Just like t-SNE, MAP has four important hyperparameters that control the resulting embedding:
Here’s a visual representation. We’ve run UMAP on our Swiss banknote data using a grid of hyperparameter values. The below diagram shows the final embeddings with different combinations of n_neighbors (rows) & min_dist (columns) using the default values of metric & n_epochs.
Notice that the cases are more spread out for smaller values of n_neighbors & min_dist & that the clusters begin to break apart with low values for the n_neighbors hyperparameter.
Above shows the final embeddings with different combinations of metric (rows) & n_epochs (columns). The effect here is a little more subtle, but the clusters thend to farther apart with more iterations. It also looks as though Manhattan distance does a slightly better job of breaking up those three smaller clusters (which we’ve not seen before!).
We’ll run UMAP on our Swiss banknote data set. Just like bfore, we first select all the columns except the categorical variable (UMAP cannot currently handle categorical variables, but this may change in the future) & pipe this data into the as.matrix() function (to prevent an irritating warning message). This matrix is then piped into the umap() function, within which we manually set the values of all four hyperparameters & set the argument verbose = TRUE so the algorithm prints a running commentary on the number of epochs (iterations) that have passed.
data(banknote, package = 'mclust')
swissTib <- as_tibble(banknote)
swissUmap <- select(swissTib, -Status) %>%
as.matrix() %>%
umap(n_neighbors = 7, min_dist = 0.1, metric = 'manhattan',
n_epochs = 200, verbose = TRUE)
## [2023-05-29 12:45:26] starting umap
## [2023-05-29 12:45:26] creating graph of nearest neighbors
## [2023-05-29 12:45:26] creating initial embedding
## [2023-05-29 12:45:26] optimizing embedding
## [2023-05-29 12:45:26] done
We’ll plot the two UMAP dimensions against each other to see how well they separated the genuine & counterfeit banknotes.
swissTibUmap <- swissTib %>%
mutate_if(.funs = scale, .predicate = is.numeric, scale = FALSE) %>%
mutate(UMAP1 = swissUmap$layout[, 1], UMAP2 = swissUmap$layout[, 2]) %>%
gather(key = 'Variable', value = 'Value', c(-UMAP1, -UMAP2, -Status))
## Warning: attributes are not identical across measure variables;
## they will be dropped
ggplotly(
ggplot(swissTibUmap, aes(UMAP1, UMAP2, col = Value, shape = Status)) +
facet_wrap(~ Variable) +
geom_point(size = 3) +
scale_colour_gradient(low = 'dark blue', high = 'cyan') +
theme_bw()
)
The UMAP embedding seems to suggest the existence of three different clusters of counterfeit banknotes. Perhaps there are three different counterfeiters at large.
Recall that, unlike t-SNE, new data can be projected reproducibly onto a UMAP embedding. Well, we can do this for the newBanknotes tibble we defined when predicting PCA component scores. in previous chapters. In fact, the process is exactly the same: we use the predict() function with the model as the first argument & the new data as the second argument. This outputs a matrix, where the rows represent the two cases & the columns represent the UMAP axes:
newBanknotes <- tibble(
Length = c(214, 216),
Left = c(130, 128),
Right = c(132, 129),
Bottom = c(12, 7),
Top = c(12, 8),
Diagonal = c(138, 142)
)
predict(swissUmap, newBanknotes)
## [2023-05-29 12:45:27] creating graph of nearest neighbors
## [2023-05-29 12:45:27] creating initial embedding
## [2023-05-29 12:45:27] optimizing embedding
## [2023-05-29 12:45:27] done
## [,1] [,2]
## 1 -2.657662 -3.239243
## 2 -1.214505 3.379858
The strengths of t-SNE & UMAP are as follows:
The weaknesses of t-SNE & UMAP are as follows: